Eslo: From Transcription to Speakers' Personal Information Annotation
نویسندگان
چکیده
This paper presents the preliminary works to put online a French oral corpus and its transcription. This corpus is the Socio-Linguistic Survey in Orleans, realized in 1968. First, we numerized the corpus, then we handwritten transcribed it with the Transcriber software adding different tags about speakers, time, noise, etc. Each document (audio file and XML file of the transcription) was described by a set of metadata stored in an XML format to allow an easy consultation. Second, we added different levels of annotations, recognition of named entities and annotation of personal information about speakers. This two annotation tasks used the CasSys system of transducer cascades. We used and modified a first cascade to recognize named entities. Then we built a second cascade to annote the designating entities, i.e. information about the speaker. These second cascade parsed the named entity annotated corpus. The objective is to locate information about the speaker and, also, what kind of information can designate him/her. These two cascades was evaluated with precision and recall measures. 1. Corpus presentation The most extensive examination of spoken French before 1980 is the Socio-Linguistic Survey in Orleans (Enquête Socio-Linguistique à Orléans, ESLO 1). This investigation was carried out towards the end of the Sixties by British academics with a didactic aim and represents a collection of 200 interviews with several references (sociological characterization of the interviewed persons, identification of the interviewer, date and place of the interview). The interview was recorded in a professional or private context, in a total of 300 hours of speech and the corpus contains approximately 4,500,000 words. Note that a new survey, ESLO 2, has been undertaken by the LLL-Orléans in order to constitute a corpus comparable in terms of data gathering and archiving. ESLO 1 and ESLO 2 will form a collection of 700 hours of recording, that is more than 10 000 000 of words. In this article we want to show the different levels of corpus annotations. Starting from the transcription which, we believe, is the first level of annotation, we will then describe how to increase the transcription with some semantic annotations in XML format. 2. Preliminary work on ESLO The ESLO project contains many steps: numerization, coding of metadata, transcription synchronic with the sound, annotation, anonymization, tools of request and diffusion. The main theoretical and technical choices operated during the scientific exploitation of the corpus ESLO1 answer to a precise objective: to participate in the reflection on the evolution of models and methods of constitution and exploitation of the oral corpus for linguistic purposes [Abouda, Baude, 2007]. 1 The part of ESLO1 was put online (http://bach.arts.kuleuven.be/elicop/) within the Elicop project [Mertens, 2002]. ESLO1 was consisted of the wave band of recordings, index card of sociological characteristics, descriptions of situation (date, place, remarks on the acoustic, etc.). Audio files were digitized; an indexation and a first cataloguing were accomplished. The objective is to make available all data to the scientific community in a format which allows optimal and intensive exploitation [Baude, 2006].
منابع مشابه
A Comparative Study of Personal and Impersonal Meta-discourse in Academic Writing
The purpose of the present study was to investigate the use of personal and impersonal metadiscourse (MD) by Persian- and English-speaking writers in academic writing. For this purpose, 80 abstracts were selected (40 written by Persian-speaking writers and 40 by English-speaking ones) and analyzed. These abstracts were selected from endocrinology and metabolism journals published during 2010 to...
متن کاملA Fully Annotated Corpus of Russian Speech
The paper introduces CORPRES – a fully annotated Russian speech corpus developed at the Department of Phonetics, St. Petersburg State University as a result of a three-year project. The corpus includes samples of different speaking styles produced by 4 male and 4 female speakers. Six levels of annotation cover all phonetic and prosodic information about the recorded speech data, including label...
متن کاملA new approach to the analysis and annotation of speech and prosody based on computerized cross-linguistic corpora
In the present paper, corpus linguistics becomes a valuable methodological tool for cross-linguistic research on speech and prosody. The inherent complexity of speech analysis and prosodic annotation increases when the object of study is a longitudinal computerized corpus of native and nonnative varieties of English. The lack of generally accepted prosodic transcription systems adds further dif...
متن کاملMismatched Crowdsourcing: Mining Latent Skills to Acquire Speech Transcriptions
Automatic speech recognition (ASR) converts audio to text. ASR is usually trained using a large quantity of labeled data, i.e., audio with text transcription. In many languages, however, text transcription is hard to find, e.g., in both Hokkien and Dinka, we found native speakers who had received all their primary education in some other language, and who therefore had difficulty writing in the...
متن کاملAnnotation Specifications of a Dialogue Corpus for Modelling Phonetic Convergence in Technical Systems
The present paper describes spoken dialogue corpus creation and its annotation specification for analysis and objective evaluation of phonetic convergence in human-human communication. The analysis of the corpus will serve for creation of convergence models which could be implemented in spoken dialogue systems based on spontaneous, expressive speech. The corpus consists of 13 hours of dialogues...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1111.3122 شماره
صفحات -
تاریخ انتشار 2010